Detecting gross alignment errors in the Spoken British National Corpus

نویسندگان

  • Ladan Baghai-Ravary
  • Sergio Grau
  • Greg Kochanski
چکیده

The paper presents methods for evaluating the accuracy of alignments between transcriptions and audio recordings. The methods have been applied to the Spoken British National Corpus, which is an extensive and varied corpus of natural unscripted speech. Early results show good agreement with human ratings of alignment accuracy. The methods also provide an indication of the location of likely alignment problems; this should allow efficient manual examination of large corpora. Automatic checking of such alignments is crucial when analysing any very large corpus, since even the best current speech alignment systems will occasionally make serious errors. The methods described here use a hybrid approach based on statistics of the speech signal itself, statistics of the labels being evaluated, and statistics linking the two.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Vague Language and Interpersonal Communication: An Analysis of Adolescent Intercultural Conversation

This paper is concerned with the analysis of the spoken language of teenagers, taken from a newly developed specialised corpus the British and Taiwanese Teenage Intercultural Communication Corpus (BATTICC). More specifically, the study employs a discourse analytical approach to examine vague language in an intercultural context among a group of British and Taiwanese adolescents, paying particul...

متن کامل

Detecting Annotation Errors in Spoken Language Corpora

Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation (van Halteren, 2000; Eskin, 2000; Dickinson and Meurers, 2003a), more recently work has also started to address errors in syntactic and other...

متن کامل

A Corpus-based Analysis of Collocational Errors in the Iranian EFL Learners' Oral Production

Collocations are one of the areas generally considered problematic for EFL learners. Iranian learners of English like other EFL learners face various problems in producing oral collocations.  An analysis of learners' spoken interlanguage both indicates the scope of the problem and the necessity to spend more time and energy by learners on mastering collocations. The present study specifically f...

متن کامل

Introduction: Compiling and analysing the Spoken British National Corpus 2014

For over twenty years, the British National Corpus has been one of the most widely known and used corpora. It is almost impossible to attend an international corpus linguistics conference such as Corpus Linguistics, ICAME (International Computer Archive of Modern and Medieval English), AACL (American Association for Corpus Linguistics) or APCLC (Asia Pacific Corpus Linguistics Conference) witho...

متن کامل

Towards Detecting Annotation Errors in Spoken Language Corpora

The issue Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation (van Halteren, 2000; Eskin, 2000; Dickinson and Meurers, 2003a), only recently has there been some work in detecting errors in synt...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1101.1682  شماره 

صفحات  -

تاریخ انتشار 2011